In the following exercises, we will use the data you have collected in the previous session (all comments for the video “The Census” by Last Week Tonight with John Oliver. Please note that your results might look slightly different than the output in the solutions for these exercises as we collected the comments earlier.
You might have to adjust the following code to use the correct file path on your computer.
comments <- readRDS("../data/ParsedComments.rds")
Next, we go through the preprocessing steps described in the slides. In a first step, we remove newline commands from the comment strings (without emojis).
library(tidyverse)
comments <- comments %>%
mutate(TextEmojiDeleted = str_replace_all(TextEmojiDeleted,
pattern = "\\\n",
replacement = " "))
Next, we tokenize the comments and create a document-feature matrix from which we remove English stopwords.
library(quanteda)
toks <- comments %>%
pull(TextEmojiDeleted) %>%
char_tolower() %>%
tokens(remove_numbers = TRUE,
remove_punct = TRUE,
remove_separators = TRUE,
remove_symbols = TRUE,
remove_hyphens = TRUE,
remove_url = TRUE)
comments_dfm <- dfm(toks,
remove = quanteda::stopwords("english"))
term_freq.
textstat_frequency() from the quanteda package to answer this question.
term_freq <- textstat_frequency(comments_dfm)
head(term_freq, 20)
## feature frequency rank docfreq group
## 1 census 1786 1 1361 all
## 2 people 1013 2 744 all
## 3 just 767 3 664 all
## 4 like 629 4 532 all
## 5 one 526 5 438 all
## 6 trump 520 6 462 all
## 7 can 494 7 433 all
## 8 know 453 8 403 all
## 9 john 442 9 410 all
## 10 get 441 10 391 all
## 11 question 396 11 331 all
## 12 government 389 12 312 all
## 13 many 370 13 320 all
## 14 us 369 14 301 all
## 15 citizens 366 15 270 all
## 16 think 299 16 274 all
## 17 even 290 17 270 all
## 18 country 287 18 239 all
## 19 illegal 279 19 216 all
## 20 want 278 20 239 all
docfreq from the term_freq object you created in the previous task.
term_freq %>%
arrange(-docfreq) %>%
head(10)
## feature frequency rank docfreq group
## 1 census 1786 1 1361 all
## 2 people 1013 2 744 all
## 3 just 767 3 664 all
## 4 like 629 4 532 all
## 5 trump 520 6 462 all
## 6 one 526 5 438 all
## 7 can 494 7 433 all
## 8 john 442 9 410 all
## 9 know 453 8 403 all
## 10 get 441 10 391 all
We also want to look at the emojis that were used in the comments on the video “The Census” by Last Week Tonight with John Oliver. Similar to what we did for the comment text without emojis, we first need to wrangle the data (remove missings, tokenize emojis, create DFM).
emoji_toks <- comments %>%
mutate_at(c("Emoji"), list(~na_if(., "NA"))) %>%
mutate (Emoji = str_trim(Emoji)) %>%
filter(!is.na(Emoji)) %>%
pull(Emoji) %>%
tokens()
EmojiDfm <- dfm(emoji_toks)
EmojiFreq <- textstat_frequency(EmojiDfm)
head(EmojiFreq, n = 10)
## feature frequency rank docfreq group
## 1 emoji_facewithtearsofjoy 115 1 68 all
## 2 emoji_rollingonthefloorlaughing 37 2 21 all
## 3 emoji_thinkingface 30 3 19 all
## 4 emoji_registered 14 4 4 all
## 5 emoji_grinningfacewithsweat 13 5 11 all
## 6 emoji_fire 12 6 3 all
## 7 emoji_grinningsquintingface 11 7 7 all
## 8 emoji_unamusedface 9 8 9 all
## 9 emoji_facewithrollingeyes 8 9 8 all
## 10 emoji_toilet 8 9 5 all
EmojiFreq %>%
arrange(-docfreq) %>%
head(10)
## feature frequency rank docfreq group
## 1 emoji_facewithtearsofjoy 115 1 68 all
## 2 emoji_rollingonthefloorlaughing 37 2 21 all
## 3 emoji_thinkingface 30 3 19 all
## 4 emoji_grinningfacewithsweat 13 5 11 all
## 5 emoji_unamusedface 9 8 9 all
## 6 emoji_facewithrollingeyes 8 9 8 all
## 7 emoji_grinningsquintingface 11 7 7 all
## 8 emoji_thumbsup 7 11 7 all
## 9 emoji_smilingfacewithsmilingeyes 6 14 6 all
## 10 emoji_manshrugging 6 14 6 all
emoji_mapping_function.R file to see what this functions does.
source("../scripts/emoji_mapping_function.R")
create_emoji_mappings(EmojiFreq, 10)
head(EmojiFreq, n = 10) %>%
ggplot(aes(x = reorder(feature, -frequency), y = frequency)) +
geom_bar(stat="identity",
color = "black",
fill = "#FFCC4D") +
geom_point() +
labs(title = "Most frequent emojis in comments",
subtitle = "The Census: Last Week Tonight with John Oliver (HBO)
\nhttps://www.youtube.com/watch?v=1aheRpmurAo&t=33s",
x = "",
y = "Frequency") +
scale_y_continuous(expand = c(0,0),
limits = c(0,150)) +
theme(panel.grid.major.x = element_blank(),
axis.text.x = element_blank(),
axis.ticks.x = element_blank()) +
mapping1 +
mapping2 +
mapping3 +
mapping4 +
mapping5 +
mapping6 +
mapping7 +
mapping8 +
mapping9 +
mapping10